The ONT hands on part includes four tutorials, which follow the Illumina tutorials 1-4:
Figure 1: Assembly workflow
In the following, we will go through Tutorial 5 Raw read analysis and filtering.
Before we start with ONT make sure you have completed the Illumina stats in the google.sheets Here.
In order to have a rough estimate of the total genome assembly length and coverage have a look at your coverage plot in the #results channel on slack.
In the example shown above, the assmebly has an approx. read coverage of 100x and the estimated genome size is about 2.5Mb
You will need that information later but it does not have to be completely precise.
Now, we will start working with the Oxford Nanopore technology (ONT) data that was sequenced on a GridION. To find out more about ONT have look here.
This has already been done for you. Nevertheless for sake of completeness we will quickly elaborate on it.
Whenever we sequence we first need to download the data to our local machine or to a server. This usually takes some time as the amount of data is large. Also the data will most likely be compacted to a smaller format to save space and time to store or transfer the data. Downloading and extracting files can be troublesome. So, it is always good practice to check if these two steps worked smoothly.
This has already been done for you. Nevertheless for sake of completeness we will quickly elaborate on it.
Large genomic files are often compressed with a format called “tar.gz”. This is very similar to the common “zip” compression (.zip). However, it is more efficient for genomic data. These files can easily be extracted with the following command:
cd ${personal_home_directory}/rawReads/
tar -xzvf ${sample_name}.tar.gz -C .To find out more about the “tar” function you can look at the help info of the tar command, or go online
This has already been done for you. Nevertheless for sake of completeness we will quickly elaborate on it.
Remember from the lecture of Alban Ramette that the signal received from ONT data is actually an electrical current that are associated with different nucleotides passign through the channel (see . From this we get fastq data which includes a corresponding quality value.
The first step when working with ONT data is that the raw read signal needs to be transformed to a fastq file. There are many different basecallers and the technology is advancing rapidly. Base calling is a very computing-intensive step. So, we cannot repeat the base calling every time a new basecaller has been published. For the the SAGE course,the base calling has already been performed with the default basecaller (Guppy 3.4) implemented on the GridION machine. So, we do not have to perform this step. We have already received the data as fastq files.
This has already been done for you. Nevertheless for sake of completeness we will quickly elaborate on it.
Just as for the Illumina sequencing technology, Oxford Nanopore Technologies also uses unique barcodes to label the different samples. Thereby, we can run multiple samples on one flowcell. The de-multiplexing is done during the base calling. Therefore, we do not have to do this step anymore. Also, all Indexes and barcodes have been removed.
Let’s get starting with exploring your ONT data! The data is located in the common_files directory and labeled according to your ESL0* number.
cd /scratch/jgianott/sage/SAGE2021_22/common_files/raw_ONT_data/you can quickly check the file by looking at it:
less <ESL0xxx>.fastq.gzand check the number of reads:
zgrep -c '@' <ESL0xxx>.fastq.gzCan you copy your ONT file to your personal location?
This is one option how the code could look like: (IMPORTANT: Change the username and sample name variable!):
###================================================
#Set your bash variables
###================================================
username=<username>
personal_home_directory=/users/${username}/scratch_link/${username}/
raw_data_directory=/scratch/jgianott/SAGE/SAGE2022_2023/common_files/raw_ONT_data
sample_name=<ESL0xxx>
###================================================
#Cpying file
###================================================
mkdir -p ${personal_home_directory}/rawReads_ONT/
cp ${raw_data_directory}/${sample_name}.fastq.gz ${personal_home_directory}/rawReads_ONT/${sample_name}.fastq.gzObviously, this step can also be performed ‘manually’ by typing both commands one after the other on the command line. But, if you plan to analyze many ONT datasets in parallel, even a simple task like copying files should be written in a script so that it can be repeatedly applied.This is the final script:
#!/bin/bash
####--------------------------------------
##SLURM options
####--------------------------------------
#SBATCH --job-name copying_fastq_ONT
#SBATCH --account jgianott_sage
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 2
#SBATCH --mem 6G
#SBATCH --time 1:00:00
#SBATCH --output /users/mndiaye1/scratch_link/mndiaye1/logs/%x_%j.out
#SBATCH --error /users/mndiaye1/scratch_link/mndiaye1/logs/%x_%j.err
####--------------------------------------
##preparation
##set you bash variables in order to quickly call them in the script
####--------------------------------------
username=<username>
personal_home_directory=/users/${username}/scratch_link/${username}/
raw_data_directory=/scratch/jgianott/SAGE/SAGE2022_2023/common_files/raw_ONT_data
sample_name=<ESL0xxx>
####--------------------------------------
#Copying file
####--------------------------------------
mkdir -p ${personal_home_directory}/rawReads_ONT/
cp ${raw_data_directory}/${sample_name}.fastq.gz ${personal_home_directory}/rawReads_ONT/${sample_name}.fastq.gzWhat information can we use to assess the quality of the raw read data?
What is contained in the second and fourth line of the fastq file.
With ONT data we are primarily interested in the read length. Read quality is only secondary, especially as we have high quality Illumina data which we will use later to correct the errors in the ONT data. Hence, we first want to check the distribution of the sequence length of our ONT reads.
What is the distribution of the raw reads? In order to analysis this we will extract all the read length and then move to R to evaluate the read length in the next chapter.
We have only quickly talked about “awk” in the previous introduction. Awk is a very versatile bash function and you will find lots of information on the internet (e.g. here). Have a look at the following lines of code and discuss within your group to make sure you understand what it is going (see hints if you need help) and write a script to do the calculations of your raw read data.
mkdir -p ${personal_home_directory}/Oxford_Nanopore_analysis/log/
##each read length
less ${personal_home_directory}/rawReads_ONT/${sample_name}.fastq.gz | awk '{if(NR%4==2) print length($1)}' > ${personal_home_directory}/Oxford_Nanopore_analysis/log/${sample_name}_raw_readLength.txtif(NR%4==2) if the file can be split into blocks of 4 then take only the second lines. the second line in a fastq contains the sequence information.
print length($1) means print for every read the length of it.
mkdir -p ${personal_home_directory}/Oxford_Nanopore_analysis/log/
##each read length
less ${personal_home_directory}/rawReads_ONT/${sample_name}.fastq.gz | awk '{if(NR%4==2) print length($1)}' > ${personal_home_directory}/Oxford_Nanopore_analysis/log/${sample_name}_raw_readLength.txtNow that you have understood the commands, you can write a SLURM script to run the commands on Wally.
This is how your script could look like:
#!/bin/bash
####--------------------------------------
##SLURM options
####--------------------------------------
#SBATCH --job-name readLenth_rawReads
#SBATCH --account jgianott_sage
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 8
#SBATCH --mem 10G
#SBATCH --time 1:00:00
#SBATCH --output /users/<username>/scratch_link/<username>/logs/%x_%j.out
#SBATCH --error /users/<username>/scratch_link/<username>/logs/%x_%j.err
####--------------------------------------
##preparation
##set you bash variables in order to quickly call them in the script
####--------------------------------------
username=<username>
personal_home_directory=/users/${username}/scratch_link/${username}
raw_data_directory=/scratch/jgianott/sage/SAGE2021_22/common_files/raw_ONT_data
sample_name=<ESL0xxx>
####--------------------------------------
##modules
####--------------------------------------
###NONE Needed!!
####--------------------------------------
##Calculate raw read statistics
echo -e "1. First calculate the read length"
####--------------------------------------
mkdir -p ${personal_home_directory}/Oxford_Nanopore_analysis/log/
less ${personal_home_directory}/rawReads_ONT/${sample_name}.fastq.gz | awk '{if(NR%4==2) print length($1)}' > ${personal_home_directory}/Oxford_Nanopore_analysis/log/${sample_name}_raw_readLength.txtHave a quick look at the ${sample_name}_raw_readLength.txt file to make sure it is what you want. In a next step, we want to move the files to our computer in order to analyse the read length distribution in R. Remember this would look something like this (don’t forget to change your username and location):
scp mndiaye1@curnagl.dcsr.unil.ch:/users/mndiaye1/scratch_link/mndiaye1/Oxford_Nanopore_analysis/log/*readLength.txt /home/mam/SAGE_test/log/Next we will read the data into R and make a read distribution plot. Write the following code as a new R script in R studio. Save the R script as ONT_readLength_distribution.R
library(tidyverse)
library(ggplot2)
rawdata_readLength <- read_csv("/home/mam/SAGE_test/log/ESL0976_raw_readLength.txt", col_names = "read_length") %>% arrange(., read_length)Now we have read the data into R we can answer the following questions:
When you have collected this information you can add it into the google sheets Here. Also discuss in your group where the differences between the samples might come from.
number_reads <- nrow(rawdata_readLength)
mean_read <- round(mean(rawdata_readLength$read_length),digits = 0)
longest_read <- max(rawdata_readLength$read_length)
print(paste0("The sample has ",number_reads, " reads. The mean read length is ",mean_read, " and the longest read is ",longest_read))## [1] "The sample has 25487 reads. The mean read length is 1779 and the longest read is 53155"
In order to get an understanding of the read distribution, we can try to answer the following questions:
#install the packwork package if necessary
#install.packages("patchwork")
library(patchwork)
##-------------------------1. Histogram
histoPlot_01 <- ggplot(rawdata_readLength,aes(x=read_length))+
geom_histogram(binwidth = 500)+theme_classic()+
geom_vline(xintercept = 5000,color="blue")+
lims(x=c(-250,longest_read))
histoPlot_01##-------------------------2. Here we calculate the which amount half of the sequencing output is
rawdata_readLength$Cumsum_forward <- cumsum(rawdata_readLength$read_length)
rawdata_readLength$Cumsum_reverse <- max(rawdata_readLength$Cumsum_forward) - cumsum(rawdata_readLength$read_length)
Half_sequencing_output <- max(rawdata_readLength$Cumsum_forward)/2
print(paste0("The total number of sequenced bases is ",max(rawdata_readLength$Cumsum_forward)," and half of that is ",Half_sequencing_output))## [1] "The total number of sequenced bases is 45332795 and half of that is 22666397.5"
##-------------------------3. Here we plot the histogram and the reverse cummulative summary
cumSumPlot_01 <- ggplot(rawdata_readLength,aes(x=read_length,y=Cumsum_reverse))+
geom_point()+theme_classic()+
geom_vline(xintercept = 5000,color="blue")+
geom_hline(yintercept = Half_sequencing_output,color="red")+
labs(y="Reverse Cummulative read sum")+
lims(x=c(-250,longest_read))+
theme(axis.title.x = element_blank(),axis.text.x = element_blank())
cumSumPlot_01+histoPlot_01+plot_layout(nrow=2,heights = c(1,2))Generally, there are three important features to consider when filtering ONT data:
In the following we will try to find the optimal amount and quality of our raw reads to use for the genome assembly.
##-------------------------1. total sequenced bases
print(paste0("The total number of sequenced bases is ",max(rawdata_readLength$Cumsum_forward)))## [1] "The total number of sequenced bases is 45332795"
##-------------------------3. Here we calculate the which amount half of the sequencing output is
sample_name="ESL0976"
genome_size=2000000
expected_genome_coverage=50
bases_needed <- genome_size*expected_genome_coverage
print(paste0("The total number of bases needed for ",expected_genome_coverage,"x coverage of a ",genome_size," bp genome are ",bases_needed))## [1] "The total number of bases needed for 50x coverage of a 2e+06 bp genome are 1e+08"
##-------------------------3. Here we plot the histogram and the reverse cummulative summary
cumSumPlot_01 <- ggplot(rawdata_readLength,aes(x=read_length,y=Cumsum_reverse))+
geom_point()+theme_classic()+
geom_vline(xintercept = 5000,color="blue")+
geom_hline(yintercept = bases_needed,color="red")+
labs(y="Reverse Cummulative read sum",title = sample_name)+
lims(x=c(0,longest_read))+
theme(axis.title.x = element_blank(),axis.text.x = element_blank())
cumSumPlot_01+histoPlot_01+plot_layout(nrow=2,heights = c(1,2))When you have created this graph we will meet in plenum to discuss the data. In order to have a good discussion, can you post your plots to the #results channel.
In the meantime here are some questions to think about?
Try to find answers, fill the google.sheets and discuss in the group.
We can now start filtering the ONT data. We do this with the filtlong tool. We can use a number of different parameters with filtlong. Check them out!
module load gcc
module load filtlong
filtlong --helpThe parameters of interest are:
filtlong -t <keep only the best reads up to this many total bases> \
--min_length <minimum length threshold> \
--min_mean_q <minimum mean quality threshold> \
--length_weight <weight given to the length score> target_bases=200000000 #variable for parameter -t"
MINIMUM_read_LENGTH=5000 #variable for parameter --min_length"
min_mean_q_CHANGED=10 #variable for parameter --min_mean_q"
length_weight=10 #variable for parameter --length_weight"If you have very short read length than adjust the variables accordingly. E.g. like this:
target_bases=200000000 #variable for parameter -t"
MINIMUM_read_LENGTH=1000 #variable for parameter --min_length"
min_mean_q_CHANGED=10 #variable for parameter --min_mean_q"
length_weight=10 #variable for parameter --length_weight"Here is how the full script could look like:
#!/bin/bash
####--------------------------------------
##SLURM options
####--------------------------------------
#SBATCH --job-name readFiltering
#SBATCH --account jgianott_sage
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 8
#SBATCH --mem 8G
#SBATCH --time 2:00:00
#SBATCH --output /users/mndiaye1/scratch_link/mndiaye1/logs/%x_%j.out
#SBATCH --error /users/mndiaye1/scratch_link/mndiaye1/logs/%x_%j.err
####--------------------------------------
##preparation
##set you bash variables in order to quickly call them in the script
####--------------------------------------
username=<username>
personal_home_directory=/users/${username}/scratch_link/${username}
sample_name=<ESL0xxx>
##!!!!!!!here you can adjust the parameters
##see how these bash variables are used in the script further bellow
MINIMUM_read_LENGTH=1000
min_mean_q=10
length_weight=10
target_bases=400000000
####--------------------------------------
##modules
####--------------------------------------
source /dcsrsoft/spack/bin/setup_dcsrsoft
module load gcc
module load filtlong
####--------------------------------------
##Filter reads
echo -e "1. First we filter the reads"
####--------------------------------------
mkdir -p ${personal_home_directory}/01_data/After_filtlong_trimming/
filtlong --min_length ${MINIMUM_read_LENGTH} \
--min_mean_q ${min_mean_q} \
--length_weight ${length_weight} \
--target_bases ${target_bases} \
${personal_home_directory}/rawReads_ONT/${sample_name}.fastq.gz > \
${personal_home_directory}/rawReads_ONT/${sample_name}_filtered.fastq
less ${personal_home_directory}/rawReads_ONT/${sample_name}_filtered.fastq | awk '{if(NR%4==2) print length($1)}' > ${personal_home_directory}/Oxford_Nanopore_analysis/log/${sample_name}_filtered_readLength.txtNow the reads should be filtered. Check the output files:
##check your error file
less /users/mndiaye1/scratch_link/mndiaye1/logs/%x_%j.err
##check your output file
less /users/mndiaye1/scratch_link/mndiaye1/logs/%x_%j.outand compare the number of reads of the raw read file and the filtered file:
zgrep -c '@' <ESL0xxx>.fastq.gz
grep -c '@' <ESL0xxx>_filtered.fastqSo far, we assessed the quality of our data, and filtered the reads based on sequence length, sequence quality, and sequencing depth. We are now ready to de novo assemble the genome based on the filtered ONT reads. Let’s move to Tutorial 6 for this task!